Add benchmark metrics persistence by ieaves · Pull Request #2339 · containers/ramalama

ieaves · 2026-01-22T07:47:44Z

Summary by Sourcery

Introduce persistent benchmark result storage and a new CLI for viewing historical benchmarks, while enriching bench output formatting and configuration support.

New Features:

Add a benchmarks CLI command with subcommands (currently list) to view stored benchmark results in table or JSON formats with pagination.
Persist bench command outputs as structured benchmark records, capturing device, configuration, and result metadata for later inspection.
Add configuration support for benchmarks, including a configurable storage folder with sensible defaults.

Enhancements:

Extend the bench command to support selectable output formats (table or JSON) and render richer tabular benchmark summaries including model parameters and throughput.
Refine configuration file discovery via a cached helper and wire benchmark settings into the global config.
Adjust CLI help/tests and llama.cpp benchmark engine spec so llama-bench emits JSON suitable for structured parsing.

Documentation:

Document the new benchmarks configuration section and add a dedicated ramalama-benchmarks(1) man page, while updating existing man pages and examples to reference the new commands and options.

Tests:

Add unit tests for the benchmarks manager and config documentation expectations, and update existing bench-related e2e/system tests for the new table headers and help behavior.

sourcery-ai · 2026-01-22T07:47:50Z

Reviewer's Guide

Adds persistent benchmark metrics collection, storage, and querying to RamaLama, including a new benchmarks CLI, schema types for benchmark records, utilities for parsing/printing results, and configuration support for benchmark storage.

File-Level Changes

Change	Details	Files
Extend `bench` CLI to support output format selection and normalize subcommand alias handling.	Add `--format {table,json}` option to `ramalama bench` with default `table` and document it in ramalama-bench.1.md. Normalize `benchmark` alias back to `bench` in post-parse setup so downstream logic only sees `bench`. Adjust tests that assert bench help/usage so they account for the new option and updated column name expectations.	`ramalama/cli.py` `ramalama/transports/base.py` `docs/ramalama-bench.1.md` `test/e2e/test_bench.py` `test/system/002-bench.bats`
Introduce a persistent benchmarks subsystem with schemas, storage manager, and utilities for parsing and printing run results.	Define dataclasses for device info, test configuration, llama-bench results, and benchmark records, plus factory helpers to create versioned objects. Implement JSON/JSONL parsing utilities and a tabular printer that formats benchmark rows (model, params, backend, ngl, threads, test, t/s, etc.). Implement `BenchmarksManager` to append benchmark records as JSONL in a configurable storage folder and to list all stored benchmarks. Add a specific `MissingStorageFolderError` (and alias `MissingDBPathError`) to signal misconfiguration of the benchmarks storage path.	`ramalama/benchmarks/schemas.py` `ramalama/benchmarks/utilities.py` `ramalama/benchmarks/manager.py` `ramalama/benchmarks/errors.py`
Wire benchmark persistence into transport bench execution and add a CLI to inspect historical results.	Change transport `bench` to always request JSON from llama-bench, parse the JSON output, build `BenchmarkRecordV1` instances, then either print as JSON or formatted table depending on `--format`. On successful bench runs, conditionally save results via `BenchmarksManager` unless `CONFIG.benchmarks.disable` is true. Add a new top-level `benchmarks` subcommand with a `list` subcommand supporting `--limit`, `--offset`, and `--format {table,json}`, and error handling for missing storage folders. Exclude `benchmarks` from the generic help-invalid-arg check since it has subcommands and document the new command in the main ramalama(1) manpage and its own ramalama-benchmarks(1) page.	`ramalama/cli.py` `ramalama/transports/base.py` `inference-spec/engines/llama.cpp.yaml` `docs/ramalama.1.md` `docs/ramalama-benchmarks.1.md` `test/system/015-help.bats`
Extend configuration to support benchmark-related settings and expose config file discovery paths.	Introduce a `Benchmarks` config dataclass with `storage_folder` (defaulting under the first existing default config dir) and `disable` flag, and add it as `CONFIG.benchmarks`. Add `get_default_benchmarks_storage_folder` and `get_config_file_path` helpers to centralize configuration directory/file resolution. Adjust `load_file_config` to track the list of parsed config files and expose them under `settings.config_files`. Update unit tests to require documentation of the new `benchmarks` config section and tweak error messages to inline lists in backticks. Document the `ramalama.benchmarks` table in docs/ramalama.conf and docs/ramalama.conf.5.md, including a configurable db path example.	`ramalama/config.py` `test/unit/test_config_documentation.py` `docs/ramalama.conf` `docs/ramalama.conf.5.md`
Refine bench output semantics and minor code cleanup.	Change displayed bench table column header from `size` to `params` and update regex-based tests accordingly. Update llama.cpp bench command spec to explicitly pass JSON output and common performance-relevant flags instead of using the previous shared options anchor. Perform small type and style cleanups (dict type hint modernization, remove stray blank line in CommandFactory, improve documentation error messages).	`test/e2e/test_bench.py` `test/system/002-bench.bats` `inference-spec/engines/llama.cpp.yaml` `ramalama/transports/base.py` `ramalama/command/factory.py` `test/unit/test_config_documentation.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist · 2026-01-22T07:48:26Z

Summary of Changes

Hello @ieaves, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities of the ramalama CLI by introducing a dedicated command for managing historical benchmark results. It provides a structured way to store, retrieve, and display performance metrics, allowing users to track and analyze model performance over time. The changes integrate seamlessly with existing benchmarking workflows and offer flexible output options.

Highlights

New 'benchmarks' command: Introduced a new ramalama benchmarks command to view and interact with historical benchmark results, stored in a local SQLite database.
Enhanced 'bench' command: The ramalama bench command now supports a --format option (table or json) for output and automatically saves benchmark results to the new metrics storage.
Structured Benchmark Data: New Python modules (errors.py, manager.py, schemas.py, utilities.py) have been added under ramalama/benchmarks to define data structures for benchmark records, device information, and test configurations, and to manage their storage and retrieval.
Configuration for Benchmarks: A new [ramalama.benchmarks] section has been added to ramalama.conf and its man page, allowing users to configure the db_path for the benchmark results database.
Llama.cpp Integration: The llama.cpp inference engine configuration has been updated to output benchmark results in JSON format, facilitating structured data capture.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey - I've found 1 security issue, 5 other issues, and left some high level feedback:

Security issues:

Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

In benchmarks_list_cli, when args.format == 'json' you pass a generator ((asdict(item) for item in results)) to json.dumps, which will fail serialization; convert to a list (e.g., [asdict(item) for item in results]) before dumping.
In normalize_benchmark_record, the error message uses type(BenchmarkRecord) instead of the actual instance type; update it to type(benchmark) so the raised NotImplementedError reports the correct offending type.
The ramalama-benchmarks man page describes results being stored in a SQLite database with a db_path, but BenchmarksManager currently writes JSONL to benchmarks.jsonl; consider aligning the implementation with the documented SQLite behavior or adjusting the configuration naming to avoid confusion.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `benchmarks_list_cli`, when `args.format == 'json'` you pass a generator (`(asdict(item) for item in results)`) to `json.dumps`, which will fail serialization; convert to a list (e.g., `[asdict(item) for item in results]`) before dumping.
- In `normalize_benchmark_record`, the error message uses `type(BenchmarkRecord)` instead of the actual instance type; update it to `type(benchmark)` so the raised `NotImplementedError` reports the correct offending type.
- The `ramalama-benchmarks` man page describes results being stored in a SQLite database with a `db_path`, but `BenchmarksManager` currently writes JSONL to `benchmarks.jsonl`; consider aligning the implementation with the documented SQLite behavior or adjusting the configuration naming to avoid confusion.

## Individual Comments

### Comment 1
<location> `ramalama/cli.py:572-574` </location>
<code_context>
+            print("No benchmark results found")
+            return
+
+        if args.format == "json":
+            output = (asdict(item) for item in results)
+            print(json.dumps(output, indent=2, sort_keys=True))
+        else:
+            print_bench_results(results)
</code_context>

<issue_to_address>
**issue (bug_risk):** JSON output path builds a generator, which `json.dumps` cannot serialize.

`output` is a generator (`(asdict(item) for item in results)`), which `json.dumps` cannot serialize and will raise a `TypeError`. Please convert to a concrete structure first, e.g. `output = [asdict(item) for item in results]`, or switch to a streaming JSON approach.
</issue_to_address>

### Comment 2
<location> `ramalama/benchmarks/schemas.py:114` </location>
<code_context>
+    configuration: TestConfigurationV1
+    result: LlamaBenchResultV1
+    version: Literal["v1"] = "v1"
+    created_at: str = datetime.now(timezone.utc).isoformat()
+    device: DeviceInfoV1 = field(default_factory=DeviceInfoV1.current_device_info)
+
</code_context>

<issue_to_address>
**issue (bug_risk):** `created_at` default is evaluated at import time, so all records share the same timestamp.

Because this default is evaluated at class definition time, every `BenchmarkRecordV1` created without an explicit `created_at` will have the same timestamp. Use a `default_factory`, e.g. `field(default_factory=lambda: datetime.now(timezone.utc).isoformat())`, to generate a fresh value per instance.
</issue_to_address>

### Comment 3
<location> `ramalama/benchmarks/schemas.py:202-206` </location>
<code_context>
+    raise NotImplementedError(f"No supported benchmark schemas for version {version}")
+
+
+def normalize_benchmark_record(benchmark: BenchmarkRecord) -> BenchmarkRecordV1:
+    if isinstance(benchmark, BenchmarkRecordV1):
+        return benchmark
+
+    raise NotImplementedError(f"Received an unsupported benchmark record type {type(BenchmarkRecord)}")
</code_context>

<issue_to_address>
**issue (bug_risk):** Error message uses `type(BenchmarkRecord)` instead of the actual `benchmark` instance.

This will always report the base class, not the actual runtime type, which makes the error misleading. Using `type(benchmark)` would correctly show the unexpected concrete type received.
</issue_to_address>

### Comment 4
<location> `ramalama/benchmarks/schemas.py:40-47` </location>
<code_context>
+    container_runtime: str = ""
+    inference_engine: str = ""
+    version: Literal["v1"] = "v1"
+    runtime_args: dict[str, Any] | None = None
+
+
</code_context>

<issue_to_address>
**suggestion:** The declared type of `runtime_args` does not match how it is populated.

In `BaseTransport.bench`, `runtime_args` receives `cmd` (a `list[str]`), but it’s annotated as `dict[str, Any] | None`. Please align the annotation with actual usage (e.g., `list[str] | None` or `object`) or change the caller to pass a mapping instead of a list.

```suggestion
@dataclass
class TestConfigurationV1(TestConfiguration):
    """Container configuration metadata for a benchmark run."""

    container_image: str = ""
    container_runtime: str = ""
    inference_engine: str = ""
    version: Literal["v1"] = "v1"
    runtime_args: list[str] | None = None
```
</issue_to_address>

### Comment 5
<location> `ramalama/config.py:153-154` </location>
<code_context>
+    version: ClassVar[Any]
+
+
+@dataclass
+class DeviceInfoV1(DeviceInfo):
+    hostname: str
</code_context>

<issue_to_address>
**issue (bug_risk):** Config field name `storage_folder` conflicts with documented `db_path` key.

Unless there’s explicit aliasing between `db_path` and `storage_folder`, values set in the config file will be ignored and the default will always be used. Please either rename the field to match the documented key or introduce a compatibility mapping so existing configs continue to work as expected.
</issue_to_address>

### Comment 6
<location> `ramalama/transports/base.py:472` </location>
<code_context>
            result = subprocess.CompletedProcess(args=escaped_cmd, returncode=0, stdout="", stderr="")
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ramalama/cli.py

ramalama/benchmarks/schemas.py

ramalama/config.py

ramalama/transports/base.py

gemini-code-assist

Code Review

This pull request introduces a significant new feature for persisting and viewing benchmark results. It adds a new benchmarks CLI command, enhances the bench command with structured JSON output, and includes configuration and documentation for these changes. The implementation is comprehensive, covering data schemas, storage management, CLI integration, and testing.

My review focuses on ensuring correctness, maintainability, and consistency between the code and its documentation. I've identified a few issues, including documentation inaccuracies regarding the storage mechanism (JSONL vs. SQLite), a potential bug with a mutable default in a dataclass, and an issue with JSON serialization of a generator. Addressing these points will improve the robustness and usability of this new feature.

ramalama/benchmarks/schemas.py

ramalama/cli.py

docs/ramalama-benchmarks.1.md

docs/ramalama.conf

docs/ramalama.conf.5.md

ramalama/benchmarks/errors.py

ramalama/benchmarks/schemas.py

ieaves · 2026-01-22T08:09:46Z

@olliewalsh something went wonky with #2237 and I had to open a new PR. Sorry! This contains the suggested documentation changes and refactor from sqlite to jsonl.

olliewalsh · 2026-01-23T11:37:04Z

ramalama/config.py


+def get_default_benchmarks_storage_folder() -> Path:
+    conf_dir = None
+    for dir in DEFAULT_CONFIG_DIRS:


this should use the store dir. The config dir may not be writable

👍. Ideally we'd be able to default the benchmarks storage directory to the user specified store path in this case. Unfortunately, is_set doesn't currently support checking nested subfields so I put it off and instead had it use the default store path as the default benchmarks base path.

The config is getting pretty complicated at this point. Do you think it'd be worth bringing something like pydantic in and managing this through the likes of BaseSettings?

ieaves · 2026-01-24T20:58:57Z

@olliewalsh This should be good to go if you're ready to merge.

olliewalsh · 2026-01-24T23:10:21Z

ramalama/cli.py

+
+    except MissingStorageFolderError:
+        print("Error: RAMALAMA__BENCHMARKS_STORAGE_FOLDER not configured")
+        sys.exit(1)


Better to raise an exception and let main() handle the exit code

Might make sense to pull that code out of the try/except altogether in that case.

olliewalsh · 2026-01-24T23:55:42Z

ramalama/transports/base.py

+            else:
+                dry_run(cmd)
+
+            result = subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")


Is this necessary? Everything else just returns None

It's not necessary, it just keeps the typing consistent without the early return. I'll switch out to an early return instead though since it's cleaner.

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

…ords Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

ieaves requested review from bmahabirbu, cgruver, engelmi, jhjaggars, maxamillion, rhatdan and swarajpande5 as code owners January 22, 2026 07:47

ieaves changed the title ~~Metrics~~ Add benchmark metrics persistence Jan 22, 2026

sourcery-ai bot reviewed Jan 22, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

ieaves force-pushed the metrics branch from 5dc2c43 to 64b640c Compare January 22, 2026 07:55

olliewalsh requested changes Jan 23, 2026

View reviewed changes

ieaves force-pushed the metrics branch from 355242b to fd3fd1f Compare January 24, 2026 20:42

ieaves temporarily deployed to macos-installer January 24, 2026 20:42 — with GitHub Actions Inactive

ieaves force-pushed the metrics branch from fd3fd1f to 8a47398 Compare January 24, 2026 20:58

ieaves temporarily deployed to macos-installer January 24, 2026 20:58 — with GitHub Actions Inactive

ieaves temporarily deployed to macos-installer January 24, 2026 22:15 — with GitHub Actions Inactive

olliewalsh approved these changes Jan 24, 2026

View reviewed changes

olliewalsh reviewed Jan 24, 2026

View reviewed changes

ieaves added 4 commits January 24, 2026 22:06

adds benchmark metrics persistence

38e5e5c

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

filter out unsupported payload values from dataclass on Benchmark rec…

ed8bf43

…ords Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

pulled list out of try/except

8534297

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

rebase + review

a6633e3

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

ieaves force-pushed the metrics branch from 9c4651b to a6633e3 Compare January 25, 2026 04:09

ieaves temporarily deployed to macos-installer January 25, 2026 04:09 — with GitHub Actions Inactive

early return for dryrun

51b83d4

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

ieaves temporarily deployed to macos-installer January 25, 2026 04:18 — with GitHub Actions Inactive

type fixes

3eeff04

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

ieaves temporarily deployed to macos-installer January 25, 2026 04:54 — with GitHub Actions Inactive

trial fix for bats-nocontainer

82843f9

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>

ieaves temporarily deployed to macos-installer January 25, 2026 07:55 — with GitHub Actions Inactive

ieaves merged commit 739563a into containers:main Jan 25, 2026
40 checks passed

Conversation

ieaves commented Jan 22, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ieaves commented Jan 22, 2026

Uh oh!

olliewalsh Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

ieaves Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ieaves commented Jan 24, 2026

Uh oh!

olliewalsh Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ieaves Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

olliewalsh Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ieaves Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ieaves commented Jan 22, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jan 22, 2026 •

edited

Loading

ieaves Jan 25, 2026 •

edited

Loading